to my humble landing page and to the Introduction to Open Data Science-course together with me and the others!
Data science is growing in importance and on this course we will be learning related practical skills and of course, have fun when doing it!
Here is a link to my GitHub repository
This week I’ve worked with wrangling, tidying and interpreting questionnaire data and learned how to plot and test linear regression models. Seems like attitude is a main factor affecting the test scores among the variables included in the study. The week has been a good refresher of the previous statistics courses and I have learned more about using Rmarkdown for presenting the results from R.
The data in question is a survey of students’ study skills SATS and their attitudes towards statistics ASSIST gathered 3.12.2014 - 10.1.2015. Students’ approach to study skills were divided into deep, strategic and surface categories. Here are the descriptions from a presentation by Kimmo Vehkalahti:.
Surface approach: memorizing without understanding, with a serious lack of personal engagement in the learning process.
Deep approach: intention to maximize understanding, with a true commitment to learning.
Strategic approach: the ways students organize their studying (apply any strategy that maximizes the chance of achieving the highest possible grades).
Points variable states the points received in a statistics exam. Those who had 0 points (which means that the participant did not attend the exam) were filtered out from the data.
learning2014 <- read.csv("https://raw.githubusercontent.com/BreezewindX/IODS-project/master/data/learning2014.csv")
summary(learning2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
The data consists of seven variables and has the query results of 166 persons, which of only 56 were men (maybe a bit small sample size). There was approximately twice the amount of females compared to males among the participants. The age of participants was between 17 and 55 years.
library(GGally)
ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
From the matrix we can see that the variables that correlate positively with points in the exam the most are attitude (0.437) and stra (0.146). Variable surf (-0.144) is negatively correlated. Correlation typically refers to how close two variables are to having a linear relationship with each other.
There is not that much correlation between the study skills, largest is -0.324 between ‘deep’ and ‘surf’ (which is notable, but logical as it’s unlikely that a student using surface approach uses also deep approach).
The distribution of age seems to be heavily skewed to the left, which means that a big portion of the participants were relatively young. The median age for men was a bit higher than for women.
The mean of attitude score for men is somewhat higher than for women and in general not that many men scored low in attitude. There seems to be not much difference in the use of deep study skills between genders and women seem to rely more on the strategic study skills.
We’ll choose the following linear model for regression \[points_i \sim \beta_i+\beta_2attitude_i+\beta_3stra_i+\beta_4surf_i+\epsilon_i\] Where \(\beta_i\) is the intercept and \(\epsilon_i\) is the error term.
# create a regression model with multiple explanatory variables
my_model <- lm(points ~ attitude + stra + surf, data = learning2014)
# print out a summary of the model
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
According to Multiple R-squared value, predictors jointly explain only 0.21 (21%) of the observed variance on the dependent variable. F-statistic for the regression tells us that we can reject the null hypothesis that all our explanatory variables are zero, so we don’t have to abandon the model at this point.
The positive coeffiecient of 3.40 for variable attitude is statistically significant even at 0.001 significance level.
P-values for variables stra and surf are higher than 0.05, which means we cannot, at the 95% confidence level, reject the null hypothesis that their coefficients are zero (ceteris paribus). Thus they are candidates for discarding.
Let’s test and alternative hypothesis that both stra and surf are jointly zero, \(H_0:\beta_3=\beta_4=0\).
library(car)
ss_results<- linearHypothesis(my_model, c("stra = 0", "surf = 0"))
print(ss_results)
## Linear hypothesis test
##
## Hypothesis:
## stra = 0
## surf = 0
##
## Model 1: restricted model
## Model 2: points ~ attitude + stra + surf
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 164 4641.1
## 2 162 4544.4 2 96.743 1.7244 0.1815
We get the P-value of 0.182 (>0.05) which means we cannot reject the joint null hypothesis at 95% confidence level.
Lets now drop these two ‘redundant’ variables. The new regression model is \[points_i \sim \beta_1 +\beta_2attitude_i+\epsilon_i\]
# create a regression model without redundant variables
new_model <- lm(points ~ attitude, data = learning2014)
# print out a summary of the model
summary(new_model)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Now the Multiple \(R^2\) is almost the same as earlier, even if we dropped the two variables. This is good because adding more variables usually makes \(R^2\) grow. We can interpret this so that the model has a better fit. Residuals have now smaller variation. The explanatory variable explains ~19% of the variance in the exam points.
The assumptions made
1. errors are normally distributed
2. errors are not correlated
3. the errors have constant variance (homoscedasticity)
\[\epsilon\sim N(0,\sigma^2) \]
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
par(mfrow = c(2,2))
plot(my_model, which=c(1,2,5))
Looking at the Q-Q-plot, the assumption of normality of the residuals seems reasonable, as most of the points are located on the line, even if some outliers can be seen at the edges.
No obvious pattern emerges in the plotting of residuals vs fitted values (variance seems constant) and none of the observations have big leverage over the whole regression.
Visual inspection implies that these assumptions are met. It’s however a bit hard to say for sure that the variance is constant, so few tests might be in place to confirm this.
We can test for multiplicative heteroscedasticity using Breusch-Pagan test, testing against \(H_0\) of no heteroscedasticity.
library(lmtest)
library(dplyr)
new_model %>%
lmtest::bptest() -> bpresults
print(bpresults)
##
## studentized Breusch-Pagan test
##
## data: .
## BP = 0.0039163, df = 1, p-value = 0.9501
We get p-value = 0.95, so cannot reject null hypothesis - we’ve found no evidence against homoscedasticity.
The White test is a generalization of the Breusch-Pagan test and may detect more general forms of heteroscedasticity.
whiteresults <- bptest(new_model, ~ attitude + I(attitude^2), data = learning2014)
print(whiteresults)
##
## studentized Breusch-Pagan test
##
## data: new_model
## BP = 0.088873, df = 2, p-value = 0.9565
With the White test we get a p-value of 0.957, which means that we have not found any evidence against the assumptions of our linear regression model.
The data are from two identical questionaires related to secondary school student alcohol comsumption in Portugal. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). You can find more information here.
Following adjustments have been made:
1. The variables not used for joining the two data have been combined by averaging (including the grade variables)
2. alc_use is the average of Dalc and Walc
3. high_use is TRUE if alc_use is higher than 2 and FALSE otherwise
Let’s takea look at the names of the variables in our dataset:
alc <- read.csv("https://raw.githubusercontent.com/BreezewindX/IODS-project/master/data/alc.csv")
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
I’ll choose high_use as dependent variable and failures, absences and sex as explanatory variables.
My intuition is that
* a high use of alcohol is likely related to absences and also failures.
* drinking might lead to failing and failing might induce even more drinking
* absences might be indicator of failures by itself, even without high alcohol consumption, as there can be multiple reasons for these absences.
* alcohol consumption habits might differ between men and women.
The logistic regression model is \[logit(p)=log(p/(1-p))=highuse \sim \beta_1+\beta_2failures+\beta_3absences+\beta_4sex+\epsilon_i\]
library(ggplot2)
ggplot(alc, aes(x = high_use, fill = sex)) +
geom_bar(position = "fill")
Majority of the students with high alcohol consumption were men.
library(ggpubr)
library(dplyr)
dens_score<- ggplot(alc, aes(x = G3)) +
geom_density()
x <- seq(1, 18, length.out=100)
df <- with(alc, data.frame(x = x, y = dnorm(x, mean(G3), sd(G3))))
dens_score2<-dens_score + geom_line(data = df, aes(x = x, y = y), color = "red")
dens_fail <- ggplot(alc, aes(x = failures, fill = high_use)) +
geom_bar(position="fill")
dens_abs <- ggplot(alc, aes(x = absences, fill = high_use)) +
geom_bar(position = "fill")
dens2_abs <- ggplot(alc, aes(x = absences, fill = high_use)) +
geom_histogram()
# facet_wrap(~ high_use)
ggarrange(dens_score2, dens_fail, dens_abs, dens2_abs,
labels = c("A", "B", "C", "D"),
ncol = 2, nrow = 2)
From A we can see that the scores are somewhat, even if not completely, normally distributed.
B shows the number of failures as a ratio between students with high and low alcohol consumption. We can see that the share of students with high alcohol use grows, as the amount of failures grow. The findings support the hypothesis that with high alcohol consumption grows the risk of failure.
Also, in C the share of students with high alcohol consumption grows as the amount of absences grow. This supports the view that students with high alcohol consumption have more absences than the students with low alcohol consumption.
D is graph C in absolute values, where we can see that the mumber of absencees is diminishing. There are some outliers in the data.
library(ggplot2)
# initialize a plot of high_use and G3
g1 <- ggplot(alc, aes(x = high_use, y = G3, col = sex))
# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("grade")
Distribution of grades seem more symmetric among students with low alcohol consumption, and grades of both sexes have more spread than in the case of high consumption. Grades of men are clearly more affected by heavy consumption, than women’s, as the mean of grades for women is almost the same in both cases. For men, there are outliers who received bad grades and how much they drink does not seem to affect it.
# initialise a plot of high_use and absences
g2 <- ggplot(alc, aes(x = high_use, y = absences, col = sex))
# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ylab("abcences") + ggtitle("Student absences by alcohol consumption and sex")
Absences seem to increase when alcohol consumption is high, which makes sense as high consumption has tendency to make you feel like crap the next morning. Again, we can see that absences of men are more affected by drinking. Women have more absences than men when consuming a little and there are pretty high amount of absences for the outliers in the women’s data.
library(GGally)
vars <- c("high_use","failures","absences","sex")
ggpairs(alc, columns = vars, mapping = aes(col = sex, alpha = 0.3), lower = list(combo = wrap("facethist")))
Last but not least, we can see that there’s not much correlation between failures and absence. There was about the same amount of both of the sexes in the data, even if there was slightly more women.
# find the model with glm()
m <- glm(high_use ~ failures + absences + sex, data = alc, family = "binomial")
# print out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ failures + absences + sex, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1855 -0.8371 -0.6000 1.1020 2.0209
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.90297 0.22626 -8.411 < 2e-16 ***
## failures 0.45082 0.18992 2.374 0.017611 *
## absences 0.09322 0.02295 4.063 4.85e-05 ***
## sexM 0.94117 0.24200 3.889 0.000101 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 424.40 on 378 degrees of freedom
## AIC: 432.4
##
## Number of Fisher Scoring iterations: 4
#Round the coefficents for cleaner look
clean <- round(coef(m), digits=3)
print(clean)
## (Intercept) failures absences sexM
## -1.903 0.451 0.093 0.941
mean(m$residuals)
## [1] 0.02143222
Here we can see the summary of the model. All explanatory variables are statistically significant at 95% confidence level. Absences and sex are significant even with 99.9%.
For every one unit change in failures, the log odds of high_use (versus low-use) increases by 0.451.
For every one unit change in absences, the log odds of high_use (versus low-use) increases by 0.093.
If the person in question is a male, the log odds of high_use increases by 0.941.
Therefore this fitted model says that, holding failures and absences at a fixed value, the odds of a high alcohol consumption for a male (sex = 1) over the odds of high alcohol consumption for female (sex = 0) is exp(0.941166) ≈ 2.563. In terms of percent change, we can say that the odds for men are 156% higher than the odds for women. The coefficient for failures says that, holding absences and sex at a fixed value, we will see 57% increase in the odds of having high alcohol consumption for a one-unit increase in failures since exp(0.4508198) ≈1.57. Variable absences has the same interpretation.
Residuals of the regression are close to mean zero, which is what we would want.
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m) %>% exp
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.1491257 0.09395441 0.228611
## failures 1.5695984 1.08339644 2.294737
## absences 1.0977032 1.05169654 1.150848
## sexM 2.5629682 1.60381392 4.149405
Here we can see the odds ratios for the variables and their corresponding confidence intervals. Note that for logistic models, confidence intervals are based on the profiled log-likelihood function.
Following is an excerpt from here.
An odds ratio measures the association between a predictor variable (x) and the outcome variable (y). It represents the ratio of the odds that an event will occur (event = 1) given the presence of the predictor x (x = 1), compared to the odds of the event occurring in the absence of that predictor (x = 0). For a given predictor (say x1), the associated beta coefficient (b1) in the logistic regression function corresponds to the log of the odds ratio for that predictor. If the odds ratio is 2, then the odds that the event occurs (event = 1) are two times higher when the predictor x is present (x = 1) versus x is absent (x = 0).
From the odds ratios we have here we can say that it’s more likely that student has high alcohol consumption if there are failures present and if the student is a male the odds are over 2.5 times as high as if they were female.
Looks like the odds ratio is a square of the regression coefficient.
The confidence interval can be interpreted so that the true value of the coefficient belongs to this interval at 95% of the times. Confint function is showing two-tailed, from the 2.5% point to the 97.5% point of the relevant distribution, which form the upper and lower limits of the intervals.
Lets recap my intuition in the beginning of this analysis part:
Seems like failures and high use of alcohol are related. We can see this from the statistically significant coefficients of the regression.
Absences are also statistically significant, but presence of them does not raise the odds that much. There is probably many more reasons for being absent than alcohol intake.
The data-analysis would seem to support the evidence, that it’s more likely that men have high alcohol consumption in the study group, and that they also get lower grades and are more absent with high alcohol consumption.
Target variable versus the predictions (2x2) tabulation
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 259 9
## TRUE 84 30
We can see that the model predicted quite accurately when high_use was false, as we got only 9 predictions wrong out of 268.
However, it did not do so well when predicting when it’s true, as prediction was false 84 times out of 114. Graphical representation follows.
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col=prediction))
# define the geom as points and draw the plot
g + geom_point()
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67801047 0.02356021 0.70157068
## TRUE 0.21989529 0.07853403 0.29842932
## Sum 0.89790576 0.10209424 1.00000000
Computing the total proportion of inaccurately classified individuals (= the training error):
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss <- loss_func(class = alc$high_use, prob = alc$probability)
print(loss)
## [1] 0.2434555
On average, our model predicts the end result wrong approx. 24% of the predictions in the training data.
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2565445
Average number of wrong predictions in the cross validation is approximately 0.26. My chosen model is exactly the same as in the datacamp exercise, so of course the prediction error is the same. When comparing different models, we would prefer a model with smaller penalty (i.e. error). If one could re-select the explanatory variables so that the error would be smaller, that model could be preferred in prediction over this one. If more variables are added to the regression, the average error grows.
library(MASS)
# load the data
data("Boston")
library(MASS)
# load the data
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
Boston dataframe has 506 observations of 14 variables:
crim - per capita crime rate by town.zn - proportion of residential land zoned for lots over 25,000 sq.ft.indus - proportion of non-retail business acres per town.chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).nox - nitrogen oxides concentration (parts per 10 million).rm - average number of rooms per dwelling.age - proportion of owner-occupied units built prior to 1940.dis - weighted mean of distances to five Boston employment centres.rad- index of accessibility to radial highways.tax - full-value property-tax rate per $10,000.ptratio - pupil-teacher ratio by town.black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.lstat - lower status of the population (percent).medv - median value of owner-occupied homes in $1000s.The information about the dataset can be found here.
# plot matrix of the variables
pairs(Boston)
Pairs graph is a mess so far, so let’s wait if we can discard some variables and then maybe try drawing it again…
Let’s draw the correlations between the variables.
library("tidyverse")
library("dplyr")
library("corrplot")
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits=2)
# print the correlation matrix
#print(cor_matrix)
# visualize the correlation matrix
corrplot.mixed(cor_matrix, tl.cex = 0.7, number.cex = .8)
We can see that the most correlated variables are
library(reshape2)
CM <- cor_matrix # Make a copy of the matrix
CM[lower.tri(CM, diag = TRUE)] <- NA # lower tri and diag set to NA
subset(melt(CM, na.rm = TRUE), abs(value) > .7)
## Var1 Var2 value
## 59 indus nox 0.76
## 89 nox age 0.73
## 101 indus dis -0.71
## 103 nox dis -0.77
## 105 age dis -0.75
## 129 indus tax 0.72
## 135 rad tax 0.91
## 195 lstat medv -0.74
Variables rad and tax are highly positively correlated (0.91). This can be interpretated so that the properties with better access to city’s radial highways also have higher property tax.
Next ones are nox and dis with -0.77 negative correlation. Interpretation is that smaller the weighted average distance to the five Boston employment centers, the larger is the nitrogen oxides concentration (worse air pollution).
As a third, with 0.76 positive correlation can be found ìndus and nox. Interpretation is that higher the proportion of non-retail business acres in town, higher is the concentration of the nitrogen oxides.
(One thing that can be noted is that the status of the inhabitants and the number of rooms correlate strongly with the housing value.)
For convenience, let’s plot histograms only for these variables of interest.
library(ggpubr)
library(ggplot2)
x <- seq(1, length.out=dim(Boston)[1])
rad_plot<-ggplot(Boston, aes(x=rad)) +
geom_histogram()
tax_plot <- ggplot(Boston, aes(x=tax)) +
geom_histogram()
nox_plot <- ggplot(Boston, aes(x=nox)) +
geom_histogram()
dis_plot <- ggplot(Boston, aes(x=dis)) +
geom_histogram()
indus_plot <- ggplot(Boston, aes(x=indus)) +
geom_histogram()
ggarrange(rad_plot, tax_plot, nox_plot, dis_plot, indus_plot,
labels = c("A", "B", "C", "D", "E"),
ncol = 3, nrow = 2)
Plot A rad is index of accessibility to radial highways. There are two concentrations in the graph, first one is index of 4-5 and the second index 24. There are 132 observations in the latter one (with index value 24), which is a high percentage of all the observations. It would make sense to check out the map of Boston to better understand the outliers.
Plot B tax has the full-value property-tax rate, and the distribution is pretty similar to what we had in plot A, i.e. concentration in the first quartile and high amount of outliers in the last quartile (132 observations, value 666).
Plots C and D (nox and dis) have also similar distributions skewed to the right.
Plot E indus is skewed to the right, and has a high count (132 observations) on a single value (18.1).
There seems to be 132 observations in the data, that are outliers for some reason.
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Scaled dataset is of type matrix, so we’ll convert it into a dataframe. In the new dataframe all the variables have zero mean.
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
Linear discriminant analysis produces results based on the assumptions that
Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low","med_low","med_high","high"))
# look at the table of the new factor crime
#table(crime)
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
#lda.fit
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2,col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
From the graph we can see the variables that imply high crime, the most important factor being rad (the arrows pointing at the high crime cluster).
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 12 10 0 0
## med_low 3 16 8 0
## med_high 0 7 20 1
## high 0 0 0 25
The model seems to predict high crime very well (all are correct) and taking a look at the predictions as a whole, the model seems to do better predicting higher rates. In all cases, the model had predicted correctly more often than wrong.
First we calculate the Euclidean distance, and then with Manhattan distance.
library(factoextra)
data("Boston")
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
# euclidean distance matrix
dist_eu <- get_dist(boston_scaled)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled, method="manhattan")
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
############
# K-means #
############
k_max <- 10
set.seed(123)
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
From the graph we choose to have two centers, as it is the point where the slope of the line changes from steep to level (using the elbow method). Let’s run K-means with two centers.
# k-means clustering
set.seed(123)
km <-kmeans(boston_scaled, centers = 2)
library(tidyverse)
library(data.table)
boston_scaled %>%
mutate(Cluster = km$cluster) %>%
group_by(Cluster) %>%
summarise_all("mean") %>%
DT::datatable(extensions = 'FixedColumns', options = list(dom = 't',
scrollX = TRUE,
scrollCollapse = TRUE))
You can scroll the datatable to see the rest of the variables, which are summarised by their means grouped by cluster number. We can see that there are differences between the two clusters, e.g. cluster #2 has higher crime, bigger proportion of non-retail business acres, more pollution, buildings are older and it’s further from the employment centers, there are less teachers per pupil and the median value of owner-occupied homes is lower.
Lets draw pairs according to the two clusters:
# k-means clustering
set.seed(123)
km <-kmeans(boston_scaled, centers = 2)
pairs(boston_scaled, col = km$cluster)
Color red is cluster #2 and black is nicer neighbourhood cluster #1.
We can see from the graph what we earlier saw numerically from the datatable. In the upper row we have crime against the variables, and we can see that cluster #2 has higher crime rates in every aspect than cluster #1.
# k-means clustering
set.seed(123)
km <-kmeans(boston_scaled, centers = 2)
pairs(boston_scaled[1:5], col = km$cluster)
If we compare crim and indus we see that the red cluster has higher crime and it is more industrialised than the black one. It’s the same with variable nox, and it’s logical that there is also more air pollution. zn tells us that red cluster is not near the river area, unlike the black cluster.
We could go on and on analysing all the variables in the graph, but I’m certain it’s not the meaning of this exercise.
data("Boston")
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
set.seed(123)
k3 <- kmeans(boston_scaled, centers = 3)
k4 <- kmeans(boston_scaled, centers = 4)
k5 <- kmeans(boston_scaled, centers = 5)
k6 <- kmeans(boston_scaled, centers = 6)
k7 <- kmeans(boston_scaled, centers = 7)
# plots to compare
p2 <- fviz_cluster(k3, geom = "point", data = boston_scaled) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point", data = boston_scaled) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point", data = boston_scaled) + ggtitle("k = 5")
p5 <- fviz_cluster(k6, geom = "point", data = boston_scaled) + ggtitle("k = 6")
p6 <- fviz_cluster(k7, geom = "point", data = boston_scaled) + ggtitle("k = 7")
library(gridExtra)
grid.arrange(p2, p3, p4, p5, p6, nrow = 3)
I choose k=3 as it has the best separation in my opinion.
set.seed(123)
km <- boston_scaled %>%
kmeans(centers = 3)
lda.fit <-lda(km$cluster~.,data=boston_scaled)
classes<-as.numeric(km$cluster)
plot(lda.fit, dimen = 2, col=classes)
lda.arrows(lda.fit, myscale = 4)
Rad and tax (and maybe indus) are the most influencial linear separators for cluster #2. Age and chas for cluster #1. Dis and rm for cluster #3. Black affects towards clusters #1 and #3. Selecting the right amount of clusters seem to be important, so it’s useful to know some methods to choose optimal amount of them (for example using the elbow method we used earlier).
data("Boston")
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
set.seed(123) #Setting seed
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low","med_low","med_high","high"))
# look at the table of the new factor crime
#table(crime)
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
lda.fit <- lda(crime ~ ., data = train)
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
set.seed(123)
data("Boston")
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
myset <- boston_scaled[ind,]
km <-kmeans(myset, centers = 2)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km$cluster)
We can see that one cluster includes approximately the dots with high and med-high crime, and the other cluster the lower crime dots. There are no other differences between the two graphs, because I set the seed and the data for both is the same. I’m using the optimal two centers here for the clusters.